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(57) ABSTRACT 

Disclosed is a method and apparatus for providing fault 
tolerance in Totem Networks by use of redundant fabrics. 
The above is accomplished in one embodiment of the 
invention by operating devices on the network in such a way 
that the devices mark the token to indicate when the token 
has been switched from one fabric to another in response to 
a timeout. A Ring Master device on the network determines, 
based on switching of the token by devices on the network 
whether a fabric or device on a fabric of the network has 
failed. In addition, fabrics that have failed are monitored to 
determine when they have become operational. Retransmis- 
sion of improperly received messages as per token-message - 
order protocols are also provided for situations in which the 
token is received before all messages intended for a given 
device have been properly received, 
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TECHNICAL FIELD 

The present invention relates in general to communication 
systems and, more particularly, to the use of redundant 
communication fabrics to enhance fault tolerance in Totem 
communication networks. 

BACKGROUND OF THE INVENTION 

A number of systems have been developed for providing 
network communications among groups of users. One such is 
system comprises a Totem ring network in which a plurality 
of devices is connected to a bus network. Each communi- 
cation device includes circuitry for interfacing with the 
Totem ring network (e.g., transmitting and receiving mes- 
sages on the Totem ring network), and a Central Processing 20 
Unit (CPU) adapted for executing processes comprising 
application programs effective for managing call processing, 
database operations, industrial control, and the like. 

A Totem network provides for multicast delivery of 
messages, wherein messages can be transmitted and deliv- 25 
ered to multiple locations, with assurance that the sequence 
in which messages are generated is maintained as the 
messages are transmitted and delivered throughout the sys- 
tem. Totem networks are well known to those skilled in the 
art and are described in greater detail in various technical 30 
papers and articles, such as an article entitled "Totem: A 
Fault Tolerant Multicast Group Communication System" by 
L. E. Moser et al., published in the April 1996, Vol. 39, No. 
4 Edition of Communications of the Association for Com- 
puting Machinery (ACM). 35 

In Totem networks, message delivery is controlled using 
a token similar to that used in a token ring system to identify 
which device can transmit onto the network. Periodically, 
such as every few milliseconds, the token is sent around the 
network to each device in sequence. As the token is received 40 
by each device, the device determines whether it has a 
message or data to transmit over the network. If a device 
does have a message or data to transmit over the network, it 
will send that data first before forwarding the token. If a 
device does not have a message or data to transmit over the 
network, then it forwards the token and sends it to the next 
device. 

Conventionally, messages on a Totem network are trans- 
mitted and delivered over a physical medium comprising a 5Q 
single fabric of wires or fiber optic cable. As a consequence, 
while Totem networks assure that messages are transmitted 
and delivered in the same sequence in which they are 
generated, there is no assurance that the messages will be 
delivered at all if a fabric fails. The physical medium of a 55 
Totem network thus has no fault tolerance designed into it. 

Accordingly, there is a need for a system and a method 
that will provide Totem networks with fault tolerance to 
enhance the probability that sequentially transmitted mes- 
sages will be delivered across the Totem network. 60 

SUMMARY OF THE INVENTION 

The present invention accordingly provides a Totem net- 
work with multiple redundant fabrics through which mes- 
sages can be transmitted and delivered. The Totem network 65 
is configured so that, if one fabric fails, another fabric can be > 
used, thereby providing a Totem system with fault tolerance.* 



The Totem network is also configured so that if a failed 
fabric has been repaired and thus becomes operational, the 
fabric repair can be detected and the repaired fabric declared 
operational so that devices on the network can use it. The 
5 Totem network is also configured so that a failure of a device 
on the network can be detected. 

The present invention further comprises a method embod- 
ied in computer software residing on the network for con- 
trolling the use of 4he redundant- fabrics. The computer 
software can be configured to detect when a fabric failure 
has occurred, and, after a failure has been detected, to 
declare the fabric to have failed so that devices on the 
network will use only fabrics that are operational. In the 
event a failed fabric has been repaired, the computer soft- 
ware can detect the repair and declare the formerly-failed 
fabric operational so that devices on the network can use it. 
The computer software can also be configured to detect 
when a device on the network has failed. 



45 



BRIEF DESCRIPTION OF THE DRAWINGS 

For a more complete understanding of the present 
invention, and the advantages thereof, reference is now 
made to the following descriptions taken in conjunction with 
the accompanying drawings, in which: 

FIG. 1 is a schematic diagram of a Totem ring network 
embodying features of the present invention; 

FIG. 2 depicts a high-level conceptual diagram of a token 
used in the network of FIG. 1; 

FIG. 3 is a flow chart illustrating control logic for marking 
a token used in connection with a ring master device 
connected to the network of FIG. 1 to indicate that a fabric 
of the network has failed; 

FIG. 4 is a flow chart illustrating control logic for marking 
a token used in connection with a ring master device 
connected to the network of FIG. 1 to indicate that a fabric 
previously determined to have failed has become opera- 
tional; and 

FIG. 5 comprises a flow chart illustrating control logic for 
switching fabrics by a communication device connected to 
the network of FIG. 1. 

DETAILED DESCRIPTION 

In the following discussion, numerous specific details are 
set forth to provide a thorough understanding of the present 
invention. However, it will be obvious to those skilled in the 
art that the present invention can be practiced without such 
specific details. In other instances, well-known elements 
have been illustrated in block diagram or schematic form in 
order not to obscure the present invention in unnecessary 
detail. Additionally, for the most part, details concerning the 
operation of Totem ring networks and the like have been 
omitted inasmuch as such details are not necessary to obtain 
a complete understanding of the present invention and are 
within the skills of persons of ordinary skill in the relevant 
art. 

Referring now to FIG. 1 of the drawings, the reference 
numeral 100 generally designates a Totem network embody- 
ing features of the present invention. The Totem network 
100 comprises a plurality of fabrics 101, two of which 
fabrics 102 and 104 are represented by solid-line ellipses in 
FIG. 1, it being understood that the network 100 may 
comprise any number of fabrics greater than or equal to two, 
as indicated by the multiple dashed- line ellipses of FIG. 1. 
Each fabric 102 and 104 comprises a physical medium 
well-known in the art, such as copper wires, fiber optic 
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cables, or the like, and may be configured to operate using is equal to zero. The token fabric switch count 206 is 

a protocol such as 10 baseT or the like. incremented when a timeout occurs after a device 114, 116, 

A plurality of communication devices well-known in the " 118 has transmitted the token 200 on one oflhe fabrics 

art, three of which devices 114, 116, and 118 are depicted in 102 ° r 104 » Dd has n °L ?«* ed ** tokel ! 200 "^m a 

FIG. 1. are each operably connected to each of the fabrics 5 predetermined; amount of tune thereafter, such as wi thin one 

102 and 104. At least one of the devices 114, 116. and 118, V™! "*« f *"= T^T" 1 f" 6 J? ™ f 

. , i i « ■ -1 u * ~ at step 320 by the ring master device 116 every rotation oi 

taken herein as the device 116, is arbitrarily chosen to serve ^ ^ ^ Qn the * Detwork 100 n lbe J oken fabric 

as a ring master device of the Totem network, toy -of the switch ^ 2Q6 stQrcs tfle number of ^ ^ a fabric 

devices may act as the ring master; however, if the nng sw itch for a particular fabric 102 or 104 has occurred during 

master fails, another device is chosen to serve as the nng fl tQken m rolatioQ around thc nctwork k)0. 

master. The ring master device 116 manages tokens and ^ m stcp 308 thc tokcn fabfic switch COUQt 20 6 is equal 

messages to determine, based on fabric switching by devices to zcro> cxccut i OI1 proceeds to step 309. If the token fabric 

114, 116, and 118, whether a failure of any of the plurality switCD count 2 06 is not equal to zero, execution proceeds to 

of fabrics has occurred. The ring master device 116 also s t e p 312. 

contains a local fabric switch count register 120 for each is Jn step 309 tbe local fabric switch count for the fabric 

fabric that holds the number of consecutive fabric switches the token was originally sent on is set to zero because the 

for the fabric. token made a successful pass through the fabric without any 

The devices 114, 116, and 118 may comprise any con- retransmissions, 

ventional computer generally capable of receiving, storing, i n s t cp 312, a determination is made whether tbe total 

processing, and outputting data. Each of the plurality of 20 \ oca \ f abr ic switch count 120 for the fabric 102 exceeds a 

devices 114, 116, and 118 is configured to switch transmis- predetermined number, such as 3. Other algorithms may be 

sion of a token among the plurality of fabrics in response to usc d for detecting poorly performing fabrics, such as count- 

a timeout after transmission of a token. While not shown in the number of tokens which have been dropped during 

detail, the devices 114, 116, and 118 include components, an immediately preceding few seconds. Thc total local fabric 

such as input and output devices, volatile and non-volatile 25 switch count 120 is stored by the ring master device 116, and 

memory, and the like, but, because such computer compo- reflects the number of token 200 rotations during which a 

nents are well known in the art, they are not shown or token 200 transmitted by the ring master 116 on the fabric 

described in further detail herein. 102 has been switched by another device 114 or 118 to the 

In FIG. 2, the reference numeral 200 generally designates ^ fabric 104. 
a token comprising a plurality of data fields, three of which If, in step 312, the predetermined number of switches has 
fields 202, 204 and 206 are shown in FIG. 2, it being not occurred, execution proceeds to step 314. In step 314, 
understood that the token 200 may comprise any number of the total local fabric switch count 120 for fabric 102 is 
data fields. The data field 202 comprises normal token data incremented, and execution then proceeds to step 310. In 
used in tokens on prior art Totem ring networks, including ^ step 310, the token 200 is processed in a well-known manner 
data identifying which device 114, 116, or 118 the token 200 in accordance with conventional Totem Ring network tech- 
is intended for, which token data is well known in the art. nology. 

The data field 204 comprises information denoting which \^ m step 312, more than a predetermined number of 
fabrics of the network 100 have failed, as determined by the switches have occurred for the fabric 102, execution pro- 
ring master device 116. The data field 206 comprises infer- ^ ceec jg to step 316. In step 316, the token 200 is marked to 
mation denoting the number of times that a device 114, 116 indicate that fabric 102 has failed, 
or 118 of the network 100 has switched from one of the From ^p 316, execution proceeds to step 318. In step 
fabrics 102 and 104 in response to a timeout following 318 thc local tokcn fabric switch count 2 06 is set to zero, 
transmission of a token 200. Execution then proceeds to step 310, discussed above. From 

FIGS. 3-5 are flowcharts of control logic implemented by 45 stC p 310, execution proceeds to step 320, wherein the token 

the devices 114, 116, and 118, for managing the plurality of f abr ic switch count 206 is set to zero. From step 320, 

fabrics 102 and 104 in accordance with the present inven- execution proceeds to step 322, wherein the token is trans- 

tion. mitted on the next non -failed fabric. For example, if the 

FIG. 3 is a flow chart of control logic that can be token 200 had been transmitted on the fabric 102, in step 

implemented on the ring master device 116 to operate as a 50 322, the token 200 may next be transmitted on the fabric 

failed-fabric detector in accordance with the present inven- 104. Upon completion of step 322, execution proceeds to 

tion. The control logic will be exemplified by showing how step 324 and is terminated. 

a fabric failure is detected by the ring master device 116, FIG. 4 is a flow chart of control logic that can be 

resulting in the marking of the token 200 to indicate that a implemented on the ring master device 116 to permit it to 

fabric 102 or 104 has failed. 55 operate as a detector of a fabric 102 or 104 that has failed 

In step 302, the ring master device 116 receives the token and has subsequently become operational. The control logic 

200 from the device 118 on fabric 102. Execution then will be exemplified by showing how a formerly-failed fabric 

proceeds to step 304. In step 304, the ring master device 116 that has now become operational is detected by a ring master 

determines whether the token 200 is intended for the ring device 116, resulting in tbe marking of a token 200 to 

master device 116. If the ring master device 116 determines 60 indicate that the formerly-failed fabric 102 or 104 is now 

that the token 200 is intended for the ring master device 116, operational. 

execution proceeds to step 308. If the ring master device 116 Referring to FIG. 4, execution is initiated at step 401 and 

determines that the token 200 is not intended for the ring proceeds to step 402 wherein a determination is made 

master device 116, execution proceeds to step 306 and whether the token 200 was marked in step 316 to indicate 

terminates. 65 that the fabric 102 or 104 has failed. If it is determined that 

In step 308, a determination is made whether a token a fabric 102 or 104 has failed, then execution proceeds to 

fabric switch count, stored in the field 206 of the token 200, step 406; otherwise, execution terminates at step 404. 
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In step 406, the ring master transmits a test message on the 
fabric 102 or 104 on which a failure has been detected. The 
test message will be transmitted around the fabric in the 
same order as a normal token by the devices on the network. 
The ring master will receive and retransmit the test message 
and count the number of limes it has done this. Execution 
proceeds to step 407 where the ring master waits for the test 
message to go around the fabric some preferable large (i.e. 
100) number of times. Execution then proceeds to step 408, 
wherein a determination is made whether a timeout has 
occurred (i.e., whether the test message did not go around 
the fabric in time), a timeout being defined as the expiration 
of a predetermined time period that would normally allow 
the test message to go around the ring the number of times 
required without the ring master device 116 having received 
a response to the test message on the failed fabric 102 or 
104. If the predetermined time period has not been 
exceeded, execution returns to step 406. If, in step 408, it is 
determined that the predetermined time period has not been 
exceeded, execution continues to step 410. In step 410, the 
token 200 is marked to indicate that the fabric 102 or 104 
that had failed is now operational and available for use by 
devices 114, 116, and 118. Upon completion of step 410, 
execution terminates at step 412. 

FIG. 5 is a flow chart of control logic that may be 
implemented on devices 114 and 118 to permit them to 
operate as a fabric switch in accordance with the present 
invention. The control logic will be exemplified by showing 
how the fabric 102 or 104 on which a token 200 is trans- 
mitted may be switched by a device 114 or 118 and a token 
fabric switch count 206 incremented in response to detection 
of a timeout. 

Referring to FIG. 5, execution is initiated in step 501 and 
proceeds to step 502, wherein a token 200 is sent by a device 
114 or 118 on a fabric 102 or 104. Execution then proceeds 
to step 504. 

In step 504, a determination is made whether a timeout 
has occurred, a timeout occurring when a predetermined 
amount of time (a timeout value) has elapsed before device 
114 or 118 has received a token 200. Such timeout value is 
set to the worst-case time it would take the token to go 
around the ring under normal operation. If it is determined 
that a timeout has not occurred, execution terminates at step 
506. If, in step 504, it is determined that a timeout has 
occurred, execution proceeds to step 508. In step 508, a 
token fabric switch count 206 is incremented. The token 
fabric switch count counts the number of times that trans- 
mission of the token 200 has been switched from one of the 
fabrics 102 or 104 to another of the fabrics 102 or 104. 
Execution then proceeds to step 510. 

In step 510, the device 114 or 118 switches to another 
non-failed fabric 102 or 104 for transmission of the token, 
depending on which fabric 102 or 104 the token was 
received by device 114 or 118 on. Execution then returns to 
step 502. 

By the practice of the present invention, fault tolerance of 
Totem ring networks is provided, which enhances the prob- 
ability that sequentially-transmitted messages will be prop- 
erly delivered across the- Totem ring network. Because there 
are multiple redundant fabrics on which tokens and mes- 
sages may be transmitted, in the event one or more of the 
fabrics fails, tokens and messages can still be transmitted on 
the network. Because the present invention also provides for 
detection of the repair of a failed fabric, once a formerly- 
failed fabric becomes operational, the network is alerted that 
the fabric is now operational and devices on the network are 
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able to use the fabric, thus resulting in increased bandwidth 
and fault tolerance of the network. 

It is understood that the present invention can take many 
forms and embodiments. Accordingly, several variations 
may be made in the foregoing without departing from the 
spirit or the scope of the invention. For exj^le^anyjrambeflp 
£of^brics,;devices^andHring^ master: devic es can be used,- so m p 
Clong^asjn^ipJeJ^^ 
reQ^mdanc y^in^the-tofem-petwbrkr ^ 

Other methods may be employed to determine that a 
specific fabric has failed. For example, the number of times 
a token switch has occurred on a specific fabric over a period 
of time may be counted, or a device at which failures 
occurred may be recorded, to more accurately identify 
poorly performing fabrics and to report the location of 
failure more accurately. 

Other fabric recovery mechanisms may also be employed. 
For example, a response may be individually requested from 
each device in the network. 

For improved performance in theTevent of-a^fajlureVtokehs-^ 
andrmessag es-ma ybe sent ion multi plefabrics jr -or- on-all 
fabrios. simultaneously ;so :that if a .token is lost on one fabric— 
it may be received on another fabric. 

Having thus described the present invention by reference 
to certain of its preferred embodiments, it is noted that the 
embodiments disclosed are illustrative rather than limiting in 
nature and that a wide range of variations, modifications, 
changes, and substitutions are contemplated in the foregoing 
disclosure and, in some instances, some features of the 
present invention may be employed without a corresponding 
use of the other features. Many such variations and modi- 
fications may be considered obvious and desirable by those 
skilled in the art based upon a review of the foregoing 
description of preferred embodiments. Accordingly, it is 
appropriate that the appended claims be construed broadly 
and in a manner consistent with the scope of the invention. 
What is claimed is: 

1. A method for providing fault tolerance in a Totem 
network, comprising the steps performed by a device oper- 
ably connected on the network of: 

receiving a token transmitted on a first fabric of a plurality 

of fabrics of the network; 
determining whether the number of times that a token has 
been switched from the first fabric to a second fabric of 
the plurality of fabrics exceeds a predetermined num- 
ber; 

upon a determination that the number of times that a token 
has been switched exceeds a predetermined number, 
marking the token to indicate that at least one of the 
plurality of fabrics has failed; and 
setting the number of fabric switches stored on the token 
to zero in response to the indication that at least one of 
the fabrics has failed. 

2. A method for providing fault tolerance in a Totem 
network, comprising the steps performed by a device oper- 
ably connected on the network of: 

receiving a token transmitted on a first fabric of a plurality 

of fabrics of the network; 
determining whether the number of times that a token has 
been switched from the first fabric to a second fabric of 
the plurality of fabrics exceeds a predetermined num- 
ber; 

upon a determination that the number of times that a token 
has been switched exceeds a predetermined number, 
marking the token to indicate that at least one of the 
plurality of fabrics has failed; 
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setting a fabric switch count stored on the token to zero in 
response to the indication that at least one of the fabrics 
has failed; and 

transmitting the token on a second fabric of the plurality 
of fabrics of the network. 5 

3. A Totem ring master device comprising: 

a processor for processing messages and tokens; 

at least one first interface connectable to each fabric of a 
plurality of fabrics comprising a Totem ring network; 1Q 

means for determining whether the number of times the 
token has been switched from a first fabric to a second 
fabric of the plurality of fabrics exceeds a predeter- 
mined number, thereby indicating that at least one of 
the plurality of fabrics has failed; and 15 

means for setting the number of fabric switches stored on 
the token to zero in response to the indication that at 
least one of the fabrics has failed. 
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8 

4. A Totem ring master device comprising: 

a processor for processing messages and tokens; 

at least one first interface connectable to each fabric of a 
plurality of fabrics comprising a Totem ring network; 

means for determining whether the number of times the 
token has been switched from a first fabric to a second 
fabric of the plurality of fabrics exceeds a predeter- 
mined number, thereby indicating that at least one of 
the plurality of fabrics has failed; 

means for setting a fabric switch count stored on the token 
to zero in response to the indication that at least one of 
the fabrics has failed; and 

means for transmitting the token on a second fabric of the 
plurality of fabrics of the network. 

***** 
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